The information presented here is placed in the public domain, and was written by Doug Burke. The notebook used to create this page is available (empty or filled), and questions can be asked using the GitHub issues page or via Twitter: https://twitter.com/doug_burke.

It will probably make a lot-more sense if you have already read the previous entries in this "series".

Reading in Astronomy data

Today I'm going to read in some Astronomy data and plot it. This will require manually parsing a FITS (Flexible Image Transport System) binary table file$^\dagger$; it was going to be more automated, but the post was getting too long so I decided to leave that part for another day and just stick with the manual code below (one of my current pet ideas is to provide a simple FITS back end to the Frames library for reading in data from FITS files, so maybe I'll write something up soon if I get time to work on it).

I have updated to the development version of IHaskell, rather than the version on Hackage, to see if it fixed some issues I was seeing (which it did). One of the changes this has made is to jump to IPython 3.0 - a.k.a. the Jupyter project - which will hopefully not be a problem for people playing along at home.

$^\dagger$ We normally use the acronym - i.e. FITS - and forget about the fact that we are using what is ostensibly an image format to read in tabular data.

Last time the notebook was run


In [1]:
import Data.Time
getCurrentTime


2015-04-05 20:57:31.597228 UTC

What on earth is an ARF?

The file I want to display contains an "effective area" curve for a source observed by the Chandra X-ray Observatory. This curve is known, to X-ray astronomers, as an ARF (Auxiliary Response File)$^\dagger$. I can use one of the tools that we provide as part of CIAO$^\ast$ to display some important information about the file.

$^\dagger$ Wow, a TLA that isn't on Wikipedia!

$^\ast$ We in the Science Data Systems group in the CXC are particularly proud of this contrived acronym, even if we can't distinguish between a language and a dialect.

The dmlist tool is used to display the structure and contents of the file formats we use when analyzing X-ray data:

% dmlist src.arf blocks

--------------------------------------------------------------------------------
Dataset: src.arf
--------------------------------------------------------------------------------

     Block Name                          Type         Dimensions
--------------------------------------------------------------------------------
Block    1: PRIMARY                        Null        
Block    2: SPECRESP                       Table         3 cols x 1070     rows

So, there are two "blocks" of data in the file, with the first one being empty. This is actually a consequence of the I in FITS, where the first block$^\ddagger$ can either be empty or contain image data. A result of this is that tabular data is always relagated to the second (or later) block.

We can also see what columns are stored in the table:

% dmlist src.arf cols

--------------------------------------------------------------------------------
Columns for Table Block SPECRESP
--------------------------------------------------------------------------------

ColNo  Name                 Unit        Type             Range
   1   ENERG_LO             keV          Real4          -Inf:+Inf            Min Energy
   2   ENERG_HI             keV          Real4          -Inf:+Inf            Max Energy
   3   SPECRESP             cm**2        Real4          -Inf:+Inf            Effective Area

and even see some of the data (which will be useful later, when I want to check that I've actually read things in properly):

% dmlist "src.arf[#row=1:5]" data

--------------------------------------------------------------------------------
Data for Table Block SPECRESP
--------------------------------------------------------------------------------

ROW    ENERG_LO             ENERG_HI             SPECRESP

     1     0.30000001192093     0.31000000238419         4.8743429184
     2     0.31000000238419     0.31999999284744        14.8292617798
     3     0.31999999284744     0.33000001311302        21.3022918701
     4     0.33000001311302     0.34000000357628        28.5149517059
     5     0.34000000357628     0.34999999403954        35.3988304138

$^\ddagger$ I really should use the term HDU - which means Header/Data Unit - rather than block, but in CIAO we tend to use the term block, and as the Chandra X-ray Center pay my bills I'm sticking with this terminology.

So, how can I read in the data? Well, I happen to know that the data is stored in blocks of text (US ASCII) for the metadata, interspersed by binary blocks for the actual data. More detailed information is available at the FITS primer and the FITS standard will also be required reading (if you are having trouble sleeping). I'll just add a note to say that in Astronomy we use the FITS standard but then layer on more domain-specific standards (occasionally these are "best practices" rather than any real form of a standard) to enhance the data representation (in particular for the metadata). This isn't really important for what I'm about to do, but I thought I'd mention it. There should also be addendums to the standard, since it has changed over time, but they aren't important for today's work.

I'll start by loading in the file as a bytestring, picking the Char8 variant since the text is in US ASCII format. It would be interesting to do this via a streaming interface, but for this example that would be overkill.


In [2]:
import qualified Data.ByteString.Char8 as B8

There are several representations for text-like data in Haskell: the language standard defines the String type, then there are ByteString and Text values which are more efficient (and have subtly-different use cases). This means that we end up with a bunch of routines that do the same thing, and often have the same name, but work on different data types$^\dagger$. For example, there's a readFile function in the standard Haskell prelude$^\ddagger$, but this returns a String, so I need to use the ByteString version, which I indicate using the prefix I just defined: B8.readFile.

$^\dagger$ There's enough semantic differences in operations that a universal "string-like" typeclass hasn't become popular enough in the community.

$^\ddagger$ the prelude is the name for the "default set of symbols" you get to use in Haskell (as it's a module name I should refer to it as Prelude).


In [3]:
cts <- B8.readFile "../data/src.arf"
:type cts


cts :: ByteString

Let's see how large the file is (since this is the Char8 version of a bytestring, we don't have to worry about non-eight-bit characters):


In [4]:
B8.length cts


34560

Now, FITS files are written in chunks of 2880 bytes - that is, the file is padded with trailing space characters (or null bytes) if needed - so how many chunks are there?


In [5]:
len = B8.length cts
len / 2880


No instance for (Fractional Int) arising from a use of ‘/’
In the expression: len / 2880
In an equation for ‘it’: it = len / 2880

Argh!!! This type-safe language has stopped me from making a type error and it's annoying! The length function returns an Int but the division operator / requires a Fractional constraint, and the error is telling me that integers (or, at least values with a type of Int), aren't fractional. What I need is to use functions from the Integral typeclass: in this case I'm going to use divMod to return both the number of blocks and any remainder.


In [6]:
len `divMod` 2880


(12,0)

So, the file contains 12 chunks. As a reminder, or perhaps I haven't mentioned it before, binary functions such as divMod can be written in "infix" form by "quoting" there name; that is, the above is equal to

divMod len 2880

I tend to use the infix form if it makes it clearer (to me) what the arguments mean. There are also those binary functions - such as + - which are defined as being "infix" so do not need the back ticks; however, when used in prefix form they must be surrounded by ():


In [7]:
1 + 2
(+) 1 2


3
3

Since the value 2880 is going to be used a few times below, let's "name" it:


In [8]:
chunk :: Int
chunk = 2880

Now, FITS files start with a Header Unit, that is 2880*n characters of ASCII text (where n is an integer), and these header units are broken up every 80 characters (known as a "card"). So, let's look at the contents of the first card:


In [9]:
B8.take 80 cts


"SIMPLE  =                    T / file does conform to FITS standard             "

Each card can be considered to be a keyword/value pair with an optional comment (or description). The values are typed, in the sense that there are strings, booleans, integers, or floats, but there's no syntactic contraint that a given key has a specific type. The layout is pretty simple

  • name of the keyword in the first 8 characters (which can only be A-Z, 0-9, - or _; I was under the impression that the name had to start with A-Z but this may not be the case, or it's clarified in some later document)

  • characters 9 and 10 are "= " (actually, there's a few informational keywords that don't have this, but I'm just going to ignore those today, since they are metadata about the metadata)

  • a string representation of the value: a character string, boolean, integer, floating point, or complex number

  • an optional description (or comment) which is started with the / character.

A more detailed description of these elements can be found in the FITS standard.

At this point you should be wondering why Astronomers use a format that requires them to write their own parser. Well, we do have standard parsers - a recent entrant is the astropy Python package and the most-used compiled version is the CFITSIO library - but I wanted to explore some simple parsing in Haskell.

Let's start by splitting up the input into cards, that is, 80-character chunks, using splitAt:


In [10]:
-- As a reminder, type introduces a synonym which can be used to help document a
-- function (in this case, which part of the return is the card and which the
-- remaining input), but it provides no "extra" type safety, since it is
-- perfectly fine to treat a Card as a ByteString (since that is what it is),
-- or vice versa.
--
type Card = B8.ByteString

-- | Given a string, split off the next 80 characters and return the remaining values.
getCard :: B8.ByteString -> (Card, B8.ByteString)
getCard = B8.splitAt 80

Here I've taken advantage of partial application to define getCard. I could have explicitly included the input variable, as in the following definition

getCard bs = B8.splitAt 80 bs

but it's common to see the partially-applied version (one argument for this form is that you focus more on the code because you aren't distracted by variables that aren't actually "doing anything"). This approach is so common that the "standard" Haskell linting tool will complain if it thinks you've not been eta-reducing enough. The notebook runs this linter by default, so we can see an example of this suggestion:


In [11]:
test bs = B8.splitAt 80 bs


Eta reduce
Found:
test bs = B8.splitAt 80 bs
Why Not:
test = B8.splitAt 80

where the Eta reduce term comes from the lambda calculus, and is mentioned in this StackOverflow question. Note that if I had given test a signature, as I did with getCard, then the suggestion would not have been created (and I'm too lazy to find out why).

As a quick check of getCard, here's the first card, where I use pattern matching to deconstruct the return value (i.e. separate out the pair), and the '_' syntax tells the compiler that I am not interested in the second element of this tuple.


In [12]:
(a,_) = getCard cts
a


"SIMPLE  =                    T / file does conform to FITS standard             "

Since pairs are a common form of tuple, there are standard routines to extract the first or second element of a pair: they are called fst and snd respective, since vowels are apparently in short supply!


In [13]:
:type fst
:type snd


fst :: forall a b. (a, b) -> a
snd :: forall a b. (a, b) -> b

Using the fst function, I can rewrite the above to avoid having to even talk about the second value returned by getCard:


In [14]:
fst (getCard cts)


"SIMPLE  =                    T / file does conform to FITS standard             "

The getCard function isn't that useful on its own. What I really want is something that splits every 80 characters, returning an array of values$^\dagger$. For this example I want to use a sequence to represent the array (rather than the standard Haskell list), since it should be more efficient for some of the data access patterns we would want when accessing the file metadata$\ddagger$. As with string-like operations, containers (such as lists, sequences, and vectors) have many common names, so I use a qualified import of the Data.Sequence module to avoid name clashes. I then explicitly import several symbols that do not clash, to avoid having to prefix them with "Seq.", as Seq.|> and Seq.>< aren't the easiest symbols to read.

$^\dagger$ There is the Haskell split package which provides many routines for splitting up lists, but it's also easy to write your own, as I do here.

$^\ddagger$ This is why I also import the symbols ViewR, |>, ><, and viewr, which involve creating and breaking-up sequences.


In [15]:
import qualified Data.Sequence as Seq
import Data.Sequence (ViewR(..), (|>), (><), viewr)

I want a function that will take a bytestring, split off the first 80 characters, and then call itself on the remaining string (in Haskell it's common to use recursion when you'd use the equivalent of a Python for loop). Here's an initial version which uses getCard to break down the input, stopping when it is empty. The syntax acc |> card adds$^\dagger$ card onto the end of the sequence acc, so the go step adds on the current card to the end of the accumulator (the first argument) until the input string is empty, at which time the sequence is returned.

$^\dagger$ Actually, it creates a new sequence - formed by combining the inputs - rather than modifying the contents of the card sequence. That is, it is not Python's acc.append(card) but more like newacc = acc[:]; newacc.append(card), but without the need to actually copy data, due to the way data is "shared" in Haskell.


In [16]:
getCards1 :: B8.ByteString -> Seq.Seq Card
getCards1 bs = go Seq.empty bs
  where
    go acc xs | B8.null xs = acc
              | otherwise  = let (card,todo) = getCard xs
                                 newacc = acc |> card
                             in go newacc todo

The function above introduces guards, that is the syntax

name args | conditional1 = value1
          | conditional2 = value2
          ...
          | otherwise    = valueN

The terms to the right of the | character (i.e. conditonal1...) are the guards, and act like a giant if statement, in that they are checked for from top to bottom, with the first condition that evaluates to True "winning" (the otherwise symbol is just another name for True, so it acts like the default label in C-style case statements).

This is an example of a routine for which "eta reduction" makes sense, in that the bs argument sent in to getCards1 is only used in the initial call to go, but I left it in for clarity (and because I'm about to show a different version which uses the era-reduced form!). In this new version, I take advantage of the unfold pattern (postscript version) - which is available for most Haskell container data types - to deal with the building of the sequence. Here I use Seq.unfoldr, which is given a routine that deconstructs the bytestring (i.e. it returns the next element and the remainder of the bytestring). This sounds a lot like getCard, although it's a slightly-modified version, as shown below, due to how unfoldr is defined:


In [17]:
getCards :: B8.ByteString -> Seq.Seq Card
getCards = Seq.unfoldr go
  where
    go bs | B8.null bs = Nothing
          | otherwise  = Just (getCard bs)

To explain this a bit further, let's look at unfoldr:


In [18]:
:type Seq.unfoldr


Seq.unfoldr :: forall b a. (b -> Maybe (a, b)) -> b -> Seq a

The first argument is a function which takes a value (of type b), and returns a value with a Maybe type. This is used in Haskell to represent optional values, since there are two cases:

  • Nothing, to represent the empty (or missing, or "null", or ...) case
  • Just v, which indicates that there is a result, with a value v

In this case, the function uses Nothing to indicate that the recursion is to stop, otherwise it returns a pair of values as Just (a,b), where a is the value to store and b is the new "seed" value (i.e. input to the next call of the function).

Let's try with a simple example: a routine that is given an integer and returns the value converted to a string, along with the initial value reduced by 1. To indicate it has finished, the routine will stop (i.e. return Nothing) if the integer value is less than 1.


In [19]:
silly :: Int -> Maybe (String, Int)
silly a = if a < 1 then Nothing else Just (show a, a-1)

Here's several examples of its behavior:


In [20]:
silly 5
silly 1
silly 0


Just ("5",4)
Just ("1",0)
Nothing

When used with unfoldr, we get a routine that takes an integer and returns a sequence of strings:


In [21]:
:type Seq.unfoldr silly


Seq.unfoldr silly :: Int -> Seq String

Hopefully, if I've been able to explain this, the following output should come as no surprise (well, apart from the fromList part, which is part of the Show representation of a Sequence):


In [22]:
Seq.unfoldr silly 5


fromList ["5","4","3","2","1"]

With that digression out of the way, let's get back to some parsing. As the data comes in chunks of 2880 bytes, let's just try with the first chunk (take returns the first n bytes of the string):


In [23]:
cards = getCards (B8.take chunk cts)
:type cards


cards :: Seq Card

In [24]:
Seq.length cards


36

I can use array indexing to access the contents:


In [25]:
Seq.index cards 0


"SIMPLE  =                    T / file does conform to FITS standard             "

In [26]:
Seq.index cards 1


"BITPIX  =                   16 / number of bits per data pixel                  "

In [27]:
Seq.index cards 2


"NAXIS   =                    0 / number of data axes                            "

As NAXIS=0, there's no data associated with this block, just metadata. The question is, just how much metadata? Now, due to the way that the header units are defined, there are three cases for the last card in a chunk of 2880 bytes:

  • a normal keyword
  • the END keyword
  • 80 spaces as a "filler" card after the END keyword (the standard says that the space character should be used but occasionally you find files that don't match the standard; fortunately I don't have to worry about that complication today).

This means that all I need to do is check the first four characters of the last card: if it's "END " or " " then it's the end of the header data. I can use the viewr function to extract the last card (this is one of the advantages of using a sequence over a Haskell list, namely efficient access to both ends of the data). The return value has the type ViewR which can either be empty (EmptyR), or lcard :> lelem, where lelem is the last element and lcard are all the cards preceeding lelem.


In [28]:
lcards :> lelem = viewr cards

With this, the last card of this block is:


In [29]:
lelem


"                                                                                "

The above call to viewr - or, more particularly, the deconstruction of the return value using :> - is incomplete, because it does not deal with all-possible values. In this particular case we know that cards is not empty, so the answer can not be EmptyR. When given an arbitrary sequence, the empty case must also be accounted for, and I'll show code later on that shows how this can be done.

This piece of code is a good example showing how Haskell's laziness can result in surprising behavior. It's not obvious that a function has actually been evaluated, even after a statement like

lcards :> lelem = viewr cards

It's a bit like poor-old Schrödinger's cat: we won't know the state until the compiler evaluates the answer ("opens the box"). In the code above, it was only when lelem was displayed that we could be sure that the viewr call had "run". As an example of this, note what happens when viewr is called with an empty sequence:


In [30]:
xxx :> yyy = viewr Seq.empty

There's no error at this point. It is only when we do something with one of the values that the error makes itself known:


In [31]:
Seq.length xxx


:1:1-28: Irrefutable pattern failed for pattern xxx Data.Sequence.:> yyy

The Irrefutable pattern failure means that it could not match the expected value of xxx :> yyy with the actual value, which in this case is EmptyR.

Getting back to the header parsing, the fact that lelem is all blank indicates that this is the end of the first block. Let's write a quick function to encode this logic, on the off chance that I may want to use it again...


In [32]:
hasEndCard :: Seq.Seq Card -> Bool
hasEndCard cards = case viewr cards of
  _ :> card -> any (`B8.isPrefixOf` card) [B8.pack " ", B8.pack "END "]
  EmptyR -> True -- treat an empty sequence as indicating the end of a block

I use the case syntax so that I can safely handle an empty input, rather than causing a run-time error (since they're somewhat frowned upon in Haskell!). If the sequence is not empty then I use pattern matching to "deconstruct" the return value; that is I use the syntax _ :> card to provide a name for the last card and to tell the compiler I don't care about the preceeding elements in the sequence (i.e. the "_" part). The check on the last card is performed using the any routine, which has the signature:


In [33]:
:type any


any :: forall a. (a -> Bool) -> [a] -> Bool

As I want to check whether card starts with " " or "END ", then the function sent to any takes a prefix and checks if matches the start of card, and the array of values are the prefixes to use. Note that there are several ways this could have been written, for instance being explicit with the functions (rather than using partial application) and moving the pack call into the anonymous function:

any (\prefix -> (B8.pack prefix) `B8.isPrefixOf` card) [" ", "END "]

The last check handles calls when the input sequence is empty, in which I somewhat-arbitrarily decided to return True. Here I was explicit in the case statement; that is, I wrote

EmptyR -> True

but you will often see the "default" case written as

_ -> True

A quick check that it works as expected:


In [34]:
hasEndCard cards


True

So, we can skip to the next chunk, which will be a header unit for the second block, but this time there's more than 2880 bytes of metadata (the drop call ignores the first chunk bytes of cts, so the take/drop calls are equivalent to Python's [chunk:2*chunk] array slice syntax):


In [35]:
cards2 = getCards (B8.take chunk (B8.drop chunk cts))
hasEndCard cards2


False

What I want is a routine that will strip off 2880-byte chunks, convert them to cards, continuing until an END card is found. This sounds like an extension to how getCards works, in that it requires repeated application of the routine. It would also be nice if it returned the unused data, since we will want that when extracting the data values.


In [36]:
getUnits :: B8.ByteString -> (Seq.Seq Card, B8.ByteString)
getUnits = go Seq.empty 
  where
    -- splitAt will work if the input text is smaller than the chunk
    -- size, but I include the check to make the code a bit clearer
    -- in intent (also, if the file is a valid FITS file then this
    -- first check should never be fired).
    --
    go cards bs | B8.length bs < chunk = (cards, bs)
                | otherwise  = let (ls,rs) = B8.splitAt chunk bs
                                   newCards = getCards ls
                                   -- the >< operator appends the two sequences together
                                   combined = cards >< newCards
                               in if hasEndCard newCards
                                  then (combined, rs)
                                  else go combined rs
                                  
(units, bdata) = getUnits (B8.drop chunk cts)

In [37]:
Seq.length units
Seq.index units 0
Seq.index units 1


216
"XTENSION= 'BINTABLE'           / binary table extension                         "
"BITPIX  =                    8 / 8-bit bytes                                    "

Looking at the data section, we can see it's binary encoded, which is no surprise since the XTENSION keyword was set to BINTABLE. I need access to the header information to be able to parse this data.


In [38]:
B8.take 20 bdata


">\153\153\154>\158\184R@\155\250\158>\158\184R>\163\215\n"

So, how do we handle the metadata section? Well, the FITS header can be viewed as a glorified key-value map, so let's use this as an API (this means that I lose ordering information, and informational keywords such as HISTORY and COMMENT, but they aren't needed to find out how to decode the data section). First I load up some useful modules, which include the Map API, the Foldable module, and the isSpace and catMaybes functions.


In [39]:
import qualified Data.Map as Map
import qualified Data.Foldable as Fold
import Data.Char (isSpace)
import Data.Maybe (catMaybes)

Next up is a helper function - rstrip - and one to strip out the (key,value) pairs from a card (splitCard).

The rstrip function takes advantage of the unsnoc function function that returns a bytestring containing all-but-the-last character, and the last character. The routine recursively calls itself until a non-whitespace character is found. I believe that the name unsnoc comes from a "play" on words, in that cons is used to describe adding an element to the start of a list (Lisp heritage), so that the act of adding an element to the end of a list is the reverse of cons, namely snoc. The un part then comes from the fact that here we are deconstructing, rather than creating, the string.

The splitCard function decides whether a card contains a key/value pair, returning Nothing if it doesn't or splitting up the card into its parts if it does (wrapping the result in Just). To make downstream processing easier, the key and value strings are "cleaned up" (trailing whitespace is removed, so it's fortuitous that I've just written a routine to do this) as well as converted from a ByteString to a String.

Now, at first glance, splitCard is inefficent; for example, it looks like key and val are always going to be evaluated, even when Nothing is returned. However, there's three points to consider:

  1. we shouldn't really be worrying about performance until it becomes obvious that it's a problem;

  2. thanks to Haskell's laziness, code is evaluated "on demand", so routines can have very-different performance characteristics compared to the same code written in a language like Python;

  3. the Haskell compiler can perform some rather impressive optimisations (although this is less relevant when running in an interactive environment like this IHaskell notebook).


In [40]:
-- | Remove trailing whitespace from a bytestring.
--
rstrip :: B8.ByteString -> B8.ByteString
rstrip bs = case B8.unsnoc bs of
  Just (nbs, c) -> if isSpace c then rstrip nbs else bs
  _ -> bs

-- | If a card has a keyword and value (i.e. is of the form
--   'keyword = value') return the two parts, after stripping
--   out trailing whitespace.
--
splitCard :: 
  Card 
  -> Maybe (String, String) -- ^ (keyword, value)
splitCard card = 
  let (l,r) = B8.splitAt 8 card
      (cs,v) = B8.splitAt 2 r
      key = rstrip l
      val = rstrip (B8.dropWhile isSpace v)
      sep = B8.pack "= "
  in if cs == sep
     then Just (B8.unpack key, B8.unpack val)
     else Nothing

We can see what splitCard does by giving it a card:


In [41]:
Seq.index units 1
splitCard (Seq.index units 1)


"BITPIX  =                    8 / 8-bit bytes                                    "
Just ("BITPIX","8 / 8-bit bytes")

The reason for converting to a (key,value) pair is that this is the format used to populate maps. The map1 map (or dictionary) - which we create below - contains one keyword, "testkey", whose value is "value". The Map.lookup function is used to query the map for a keyword: if it doesn't exist then Nothing is returned, otherwise the value is returned wrapped inside a Just.


In [42]:
map1 = Map.fromList [("testkey", "value")]

Map.lookup "42" map1
Map.lookup "testkey" map1


Nothing
Just "value"

The initial version of the conversion routine - i.e. to create the key/value map from a sequence of header cards - is given below, as cardsToMap2 (the awkward numeric suffix is because I'm "counting down" to the final version of the routine):


In [43]:
cardsToMap2 :: Seq.Seq Card -> Map.Map String String
cardsToMap2 cs = 
  let -- step 1
      xs :: [Card]
      xs = Fold.toList cs

      -- step 2
      ys :: [Maybe (String,String)]
      ys = map splitCard xs

      -- step 3
      zs :: [(String,String)]
      zs = catMaybes ys

  in Map.fromList zs -- step 4

The steps have been explicitly broken out and I have included the types of the terms for documentation. The four steps are:

  1. convert from a sequence into a list using Fold.toList (the Foldable module provides a number of generic routines useful for sequences - i.e. it somewhat contradicts my earlier comment that there's no common abstraction - but it is quite an "abstract" abstraction, so I have stayed away from using it in this notebook where possible);

  2. apply splitCard to each card;

  3. filter out all the Nothing entries created by splitCard and extract the values from the Just elements using the catMaybes function (which has a signature [Maybe a] -> [a]);

  4. convert this list into a map using the fromList function.

This function can also be written in one line:


In [44]:
cardsToMap1 :: Seq.Seq Card -> Map.Map String String
cardsToMap1 cs = Map.fromList (catMaybes (map splitCard (Fold.toList cs)))


Use mapMaybe
Found:
catMaybes (map splitCard (Fold.toList cs))
Why Not:
mapMaybe splitCard (Fold.toList cs)

which I only introduced as it will hopefully make it more obvious what the final version of the routine is doing (shown below). However, it also highlights a code simplification, since the pattern catMaybes (map f xs) can be replaced with mapMaybe f xs. The mapMaybe function has the signature


In [45]:
import Data.Maybe (mapMaybe)
:type mapMaybe


mapMaybe :: forall a b. (a -> Maybe b) -> [a] -> [b]

and will apply the function (here, splitCard) to every element of the list, throwing out all the Nothing values and extracting the successful values by removing the Just constructor. This modification was proffered by the Haskell linter (hlint), which also suggested the eta-reduction step earlier.

The final version of the conversion routine is:


In [46]:
cardsToMap :: Seq.Seq Card -> Map.Map String String
cardsToMap = Map.fromList . mapMaybe splitCard . Fold.toList

In this version I have used point-free syntax (which - for the purpose of this notebook - means without including parameter names), partial application, and function composition (the "." operator) to combine the steps. This is all a big digression, since the three versions of the routine all do the same thing, but I mention it here as this style is common in Haskell code.

The . operator is used in two ways in the function cardsToMap:

  1. to indicate the namespace of a symbol (e.g. that filter is taken from the Seq module);

  2. to combine functions.

In the second case, . is itself a function, with the following signature


In [47]:
:type (.)


(.) :: forall b c a. (b -> c) -> (a -> b) -> a -> c

That is, if f and g are single-argument functions, then f . g and g . f are both single argument functions, where f . g means the same as f (g x) and g . f means g (f x). So, function composition is read right to left. I find that this syntax can help focus on the way information flows through a sequence of functions, but it can definitely be hard to read, and occasionally taken way past sensible extremes!

It's time to stop the digressions and actually use the function:


In [48]:
hdr = cardsToMap units

The size function returns the number of keys in the map:


In [49]:
Map.size hdr


109

and I can use lookup on it:


In [50]:
Map.lookup "XTENSION" hdr
Map.lookup "DOESNOTEXIST" hdr


Just "'BINTABLE'           / binary table extension"
Nothing

Since I'm going to want to call lookup a few times - and I'm lazy - I want a helper routine that means I don't have to send in the header, and automatically removes the Just wrapper:


In [51]:
import Data.Maybe (fromJust)

getKey1 key = fromJust (Map.lookup key hdr)

Following my point-free mania, this can be "simplified" to:


In [52]:
getKey = fromJust . flip Map.lookup hdr

This uses the flip function to swap the argument order to lookup, so that I can use eta reduction to avoid naming the keyword variable, and fromJust to extract the value from the Just constructor. Now, fromJust is what is known as a partial function since it is not defined for all inputs: that is, if the input is a Nothing the routine will raise a run-time error. This is not normally considered "good form" in Haskell code, but I feel it's okay for this interactive exploration.

Here's a run through of how the types work out when flip is used. As with many of the "tips" I've been showing here, it should be used with care!


In [53]:
:type Map.lookup
:type flip
:type flip Map.lookup
:type flip Map.lookup hdr
:type fromJust . flip Map.lookup hdr


Map.lookup :: forall k a. Ord k => k -> Map k a -> Maybe a
flip :: forall a b c. (a -> b -> c) -> b -> a -> c
flip Map.lookup :: forall a a1. Ord a => Map a a1 -> a -> Maybe a1
flip Map.lookup hdr :: String -> Maybe String
fromJust . flip Map.lookup hdr :: String -> String

In case you are interested, here is an example of getKey "blowing up" due to a missing keyword:


In [54]:
getKey "DOESNOTEXIST"


Maybe.fromJust: Nothing

For a FITS binary table, the important "structural" keywords - i.e. those that define the layout and size of the binary data - are:


In [55]:
getKey "BITPIX"
getKey "NAXIS"
getKey "NAXIS1"
getKey "NAXIS2"
getKey "TFIELDS"


"8 / 8-bit bytes"
"2 / 2-dimensional binary table"
"12 / width of table in bytes"
"1070 / number of rows in table"
"3 / number of fields in each row"

So, there are three fields - i.e. columns - in each row, each row is 12 bytes long, and there are 1070 rows of data. The field (column) information is stored in the following keywords:


In [56]:
getKey "TTYPE1"
getKey "TFORM1"
getKey "TUNIT1"


"'ENERG_LO'           / Min Energy"
"'1E      '           / format of field"
"'keV     '"

In [57]:
getKey "TTYPE2"
getKey "TFORM2"
getKey "TUNIT2"


"'ENERG_HI'           / Max Energy"
"'1E      '           / format of field"
"'keV     '"

In [58]:
getKey "TTYPE3"
getKey "TFORM3"
getKey "TUNIT3"


"'SPECRESP'           / Effective Area"
"'1E      '           / format of field"
"'cm**2   '"

So, the three columns are called ENERG_LO, ENERG_HI, and SPECRESP (the TTYPEn values); the data is represented as three 32-bit IEEE-754 floating-point numbers with no packing (the NAXIS1 and TFORMn values) and is in big-endian format$^\dagger$. The TUNITn keys give any units associated with the column, and are not needed to parse the data. It would be an interesting experiment to see if the unit value could be used to create a "numerically-typed" variable, as I used in my first three notebooks, and I may come back to this idea at some unspecified point in the future!

$^\dagger$ I'm sure the standard specifies this, but I just tried the various "endian" conversion routines until I got the answers I expected!

For this notebook I am going to use an "array of structures" representation - that is, create a record to represent each row - rather than a "structure of arrays". For this, I need a row type, such as

data Row = Row { energLo :: Float, energHi :: Float, specResp :: Float }

which uses the Haskell record syntax to allow the fields of the structure to be named (it's a bit like a C-style struct or Python's namedtuple).

However, I'm going to be a little-more general than this, and allow the row to be parameterised by the data type used to store the data; that is:


In [59]:
data Row a = Row { energLo :: a, energHi :: a, specResp :: a } deriving (Eq, Show)

I've told the compiler to automatically create instances of the Eq and Show typeclasses for this type: the latter because it'll be useful when checking that the parsing has worked and the former for a sanity check I make later on, once I've read in all the data.


In [60]:
instance Functor Row where
  fmap f (Row elo ehi spec) = Row (f elo) (f ehi) (f spec)

I manually create a Functor instance because I'll want that when plotting up the results$^\dagger$. I could have got the compiler to derive this code too, by turning on the DeriveFunctor extension, but I felt it useful to show the code. The functor instance lets you apply a function to all elements of the row (so just like how map lets you apply a function to all elements of a list and it is how I applied splitCard to every element in the sequence as part of the cardsToMap function).

$^\dagger$ This isn't strictly necessary, as it doesn't save much code in this particular example, but why not!

As an example of this functor instance, let's create a row of float values:


In [61]:
testRow :: Row Float
testRow = Row 0.1 0.2 10.0

which can be displayed (taking advantage of the automatically-derived Show instance):


In [62]:
testRow


Row {energLo = 0.1, energHi = 0.2, specResp = 10.0}

The functor instance allows us to convert all the values to a string, by applying show to each field using fmap:


In [63]:
fmap show testRow
:type fmap show testRow


Row {energLo = "0.1", energHi = "0.2", specResp = "10.0"}
fmap show testRow :: Row String

or even keep the values with a Float type but changing their value - e.g. multiplying by two:


In [64]:
fmap (*2) testRow


Row {energLo = 0.2, energHi = 0.4, specResp = 20.0}

although I have to admit that it's hard to see the use in this last example for this particular data structure!

Now there's two "main" systems for converting to and from binary data: binary and cereal. I've chosen to use cereal because it has in-built support for IEEE-754 conversion.


In [65]:
import Data.Serialize.Get
import Data.Serialize.IEEE754

Let's start simply, by just parsing the first value (so the ENERG_LO column of the first row) and checking that it has a value of 0.3 (which is what dmlist reported its value to be). The simplest way to do this is to use runGet, which takes a parser and a bytestring, and returns the answer. For 32-bit floats in big-endian order, the parser is called getFloat32be:


In [66]:
runGet getFloat32be bdata


Right 0.3

Success! By manually moving through the binary data, I can extract following elements, such as the ENERG_HI, and SPECRESP values from the first row, and the ENERG_LO value from the second row:


In [67]:
runGet getFloat32be (B8.drop 4 bdata)
runGet getFloat32be (B8.drop 8 bdata)
runGet getFloat32be (B8.drop 12 bdata)


Right 0.31
Right 4.874343
Right 0.31

The values returned by runGet are preceeded by Right, which is because the return value is actually Either String a, where a is the type of the parser (Float in this case). The Either type, is - like the Maybe type - used to handle cases where two different values are needed. In the Maybe case this is Nothing (e.g. "failure") or Just a; in the Either case, the two values are Left a and Right b, where the types of a and b can be different. One of the common use cases for Either values is to handle calculations that can fail, where the "left side" has a value of Left String, for the error message, and the "right side" is used to indicate success, which is what we have here (Right a). So, we can ignore for now the Right prefixes, and concentrate on the numeric values. As a reminder, the dmlist output for the three rows was:

ROW    ENERG_LO             ENERG_HI             SPECRESP

     1     0.30000001192093     0.31000000238419         4.8743429184
     2     0.31000000238419     0.31999999284744        14.8292617798
     3     0.31999999284744     0.33000001311302        21.3022918701

which shows that I'm getting the right values (modulo differences in the display of floating-point values).

So, I could use this to parse each element at a time, manually stepping through the bytestring, but that seems like a lot of work. The next thing to try is the runGetState routine, which also returns the unparsed data. That is:


In [68]:
-- The last argument is the index into the input string at which to start parsing,
-- which is 0 here.
--
Right (v1, bdata1) = runGetState getFloat32be bdata 0
Right (v2, bdata2) = runGetState getFloat32be bdata1 0
Right (v3, bdata3) = runGetState getFloat32be bdata2 0
Right (v4, _)      = runGetState getFloat32be bdata3 0

(v1, v2, v3, v4)


(0.3,0.31,4.874343,0.31)

Here I've assumed that the parsing can not fail, so it is safe to assume the return value is of the form Right .... If it did fail then we would see the Irrefutable pattern error (shown earlier in the discussion of viewr) when one of the values was used (in this case, displaying the output of the v1..v4 tuple).

The code above can hardly be called elegant. One solution is to expand the "language" of the parser, so that it can parse things that we are interested in - in this case a Row Float32 - so that I can leave runGet (or runGetState) to deal with the boring parts. So, let's look at the type of the getFloat32be parser:


In [69]:
:type getFloat32be


getFloat32be :: Get Float

Hopefully this signature will be somewhat surprising (to the non-Haskellers reading this, anyway) since

  • what is this Get structure?

  • it doesn't seem to have any way to input the bytestring, nor does it have a way of returning the unparsed data!

Well, this is how Haskell deals with context-sensitive computations. In this case, "context" means "the remaining bytestring to parse", but it can also be a store of messages that log a computation, or some piece of state information that can be both read or changed (e.g. the arrangement of a plot) during a computation, or whether a computation has failed, or ... The most obvious context, which I've been using in these notebooks without mentioning it, is I/O (e.g. printing messages to the screen or reading the contents of the file). Haskell bases all these different sorts of computation on the same underlying mathematical abstraction. It can be thought of as a "mini language" inside Haskell, often recognizable by code written in the style

do
  a <- someFunc
  b <- anotherFunc a
  return (a+b)

which we have seen in the previous notebooks (and will see again below) when creating graphs. Note that return in Haskell can be confusing for people with experience in languages such as C, Fortran, or Python. For now, just be aware that return is one of those places that Haskell is different to other languages.

What the above means is that I can write a parser for a single row of data - which consists of three floats with no padding - with the following code (not that I'd expect you to be able to work this out from my description!):


In [70]:
getRow1 :: Get (Row Float)
getRow1 = do
  elo <- getFloat32be
  ehi <- getFloat32be
  spec <- getFloat32be
  return (Row elo ehi spec)

Checking this out with runGet gives:


In [71]:
runGet getRow1 bdata


Right (Row {energLo = 0.3, energHi = 0.31, specResp = 4.874343})

and


In [72]:
runGet getRow1 (B8.drop 12 bdata)


Right (Row {energLo = 0.31, energHi = 0.32, specResp = 14.829262})

Out of interest, here's what happens when the parsing fails (which in this case means "runs out of data", which I simulate by sending in less-than twelve bytes):


In [73]:
runGet getRow1 (B8.take 4 bdata)


Left "too few bytes\nFrom:\tdemandInput\n\n"

The fact that I labelled this routine getRow1 might suggest to you that I'm going to re-write it, and you would not be wrong. Using yet-more mathematical abstractions - in this case applicative functors - the routine can be written more compactly as:


In [74]:
import Control.Applicative ((<$>), (<*>))

getRow :: Get (Row Float)
getRow = Row <$> getFloat32be <*> getFloat32be <*> getFloat32be

This version resembles the "point-free style" I described earlier, in that it avoids naming "temporary things". I mention it here in case you see these symbols (<$> and <*>) around and wonder what they are (plus, this is how I originally wrote the routine before deciding it was a tad cryptic to lead with). As an aside, GHC 7.10 came out whilst I was putting together this notebook. One of the changes in this release is known as the "AMP", which - for the purposes of this notebook - means that the code I've written may need small tweaks to compile without warning messages when using GHC version 7.10 or later.

I can now build on getRow (or, equivalently, getRow1), to write a parser for multiple rows, by repeatedly calling getRow until the parser runs out of data. Fortunately there's a function - isEmpty - for this check. The resulting parser, which I based on code from the binary module documentation, is


In [75]:
getRows :: Get [Row Float]
getRows = do
  empty <- isEmpty
  if empty
    then return []
    else do
      row <- getRow
      rows <- getRows
      return (row:rows)

When run, it checks to see if there's any more data. If not, the result is the empty list, otherwise it parses the current row, then makes a recursive call (parsing the rest of the rows), before combining the two.

This time, when running the parser, I explicitly deal with the error case by converting it to an error, which will (hopefully) provide more useful information than an Irrefutable pattern error (although I don't plan on showing this feature off here ;-). In the success case I strip off the Right wrapper.


In [76]:
arows = case runGet getRows bdata of
  Left msg -> error msg
  Right v -> v

In [77]:
head arows
arows !! 1


Row {energLo = 0.3, energHi = 0.31, specResp = 4.874343}
Row {energLo = 0.31, energHi = 0.32, specResp = 14.829262}

So, yet-more success. Let's check that there are the correct number of rows (NAXIS2 was 1070):


In [78]:
length arows


1200

Oops. This is because the data blocks are chunked up into units of 2880 bytes, just like the header blocks, so there are excess rows, as 1070 times 12 does not map to a chunk boundary:


In [79]:
(1070 * 12) `divMod` chunk
(1200 * 12) `divMod` chunk


(4,1320)
(5,0)

In this particular case the extra values all decode to 0:


In [80]:
arows !! 1069
arows !! 1070
arows !! 1199


Row {energLo = 10.99, energHi = 11.0, specResp = 0.5657126}
Row {energLo = 0.0, energHi = 0.0, specResp = 0.0}
Row {energLo = 0.0, energHi = 0.0, specResp = 0.0}

If I were doing this properly, I'd send in the correct number of bytes to getRows (in this case NAXIS1 times NAXIS2 bytes) but for now I can just drop the invalid rows:


In [81]:
rows = take 1070 arows

head rows
rows !! 1
rows !! 2
rows !! 3
rows !! 4


Row {energLo = 0.3, energHi = 0.31, specResp = 4.874343}
Row {energLo = 0.31, energHi = 0.32, specResp = 14.829262}
Row {energLo = 0.32, energHi = 0.33, specResp = 21.302292}
Row {energLo = 0.33, energHi = 0.34, specResp = 28.514952}
Row {energLo = 0.34, energHi = 0.35, specResp = 35.39883}

It looks like the bins are consecutive; that is, the energHi field of row n is the same as the energLo field of row n+1. To see whether this holds for all rows, I extract the contents of the "low" and "high" fields with calls to map, remove the first element of the "low" array (with tail, which is like Python's [1:] array slice), the last element of the "high" array (with init, which is Python's [:-1] slice), and then check that the resulting arrays are equal. This only works because of the Eq instance of Row that I earlier asked the compiler to derive for me (that is, == can deal with type [a] - a list of a's - if it knows how to compare a's).


In [82]:
elos = map energLo rows
ehis = map energHi rows

tail elos == init ehis


True

Finally, it's time to make a plot of the data. I load in the necessary code to ensure that the plots can get displayed in-line (this just re-uses the code I created in previous notebooks):


In [83]:
import IHaskell.Display
import Graphics.Rendering.Chart.Backend.Diagrams

-- Apparently |> is also defined in the Easy module, so, as I have already imported
-- it from Data.Sequence, I explicitly avoid importing it here.
--
import Graphics.Rendering.Chart.Easy hiding ((|>))

instance IHaskellDisplay (Renderable a) where
  display renderable = renderableToSVG renderable 450 300 >>= display . fst

Now I create a function which takes a list of rows and creates a plot. Since the data is binned - in that the SPECRESP column is defined over an energy range - then I want to display it as a histogram. Now, there is support for drawing histograms in Chart (e.g. PlotBars) but I don't find the interface intuitive, so I am going to create a plot manually$^\dagger$. I can do this by drawing a line that connects (elo_0,specresp_0), (ehi_0,specresp_0), (elo_1,specresp_1), (ehi_1,specresp_1), ... (elo_1069,specresp_1069), (ehi_1069,specresp_1069). Ideally the list would start at (elo_0,0) and end at (ehi_1069,0), but I'll leave that for now. The code also takes advantage of the Functor instance of the Row type that I wrote earlier, to convert the storage type from Float to Double (the reason for this is given in the comments).

Note that the plot is defined using "do" notation, although this time there are no obvious calls to "set" variables (i.e. nothing of the form a <- function). This is because the Chart API uses the lens package to greatly-reduce the amount of code needed to set the various fields it uses (such as the various "titles" which I change below). If you're interested in trying out Chart, I would use the Chart examples as a basis, before looking too deeply into how to use lens.

$^\dagger$ pointers to existing code that already does this are most welcome.


In [84]:
drawARF ins = toRenderable (do
    layout_title .= "ARF"
    layout_x_axis . laxis_title .= "E (keV)"
    layout_y_axis . laxis_title .= "cm^2"
    plot (line "" [cs])
    )
  where
    -- At present there's no PlotValue instance for Float types, which means
    -- that I can't just send a list of Float values to line (I have a
    -- PR on GitHub to address this issue at
    -- https://github.com/timbod7/haskell-chart/pull/77,
    -- which has now been accepted, but rather than re-build with that
    -- version, the easiest solution is to convert the Float values
    -- to Double. I could do this manually, but as I have a Functor instance
    -- on the Row type, I can convert Row Float to Row Double just with a
    -- call to "fmap realToFrac".
    --
    rs = map (fmap realToFrac) ins
    
    (elos, ehis, arfs) = unzip3 (map (\(Row elo ehi arf) -> (elo,ehi,arf)) rs)
    xs = concat (zipWith (\a b -> [a,b]) elos ehis)
    ys = concatMap (\a -> [a,a]) arfs
    cs = zip xs ys

Note that there is a Functor instance of lists, and it is the same as map, which means that I could have said

    rs = fmap (fmap realToFrac) ins

but while the compiler can work out which fmap it needs to use, for this example I decided to be explicit!

Displaying the ARF

The moment of truth: what does the ARF look like?


In [85]:
drawARF rows


When discussing the k correction, the spectral models were in units of photon/cm$^2$/s/keV. This measures the flux of photons from a source (photon/s) per energy (the /keV term) and per unit area (the /cm$^2$ term). The ARF is given in units of area (cm$^2$), so if you multiply it by the spectral model you get a spectrum in units of photon/s/keV, which - when integrated up over an energy range - gives a flux in units of photon/s. A physical interpretation is that the ARF measures the sensitivity of the telescope as a function of energy. The fact that the ARF is not a constant, and in fact shows significant structure such as a large edge at 2 keV, is one reason why understanding X-ray data (that is, comparing models to data), is not simple.

The Chandra mirrors are large - "human" size, with a mass of about 1500 kg - but, because this is an X-ray telescope, they are not arranged as they would be in an optical telescope (such as the Ritchey–Chrétien design used in the Hubble Space Telescope) which reflects light, but as four concentric cylinders (known as a Wolter Type-I design) which uses grazing-incidence reflection to focus the light:

Image credits: NASA/CXC/D.Berry, available at http://chandra.harvard.edu/resources/illustrations/teleSchem.html

The cylinders have diameters ranging from 65 cm to 1.23 m, but an area of 400 cm$^2$ is equivalent (when thinking about an area of sky measuring the flux) to a circle of radius 11 cm:


In [86]:
sqrt (400 / pi)


11.283791670955125

This can lead to jealousy when talking to our optical colleagues, who can talk about using telescopes with radii of 4 m and above, and there are plans for 15 m radii in the near future! In defense of the people who build X-ray mirrors, optical telescopes do not use the full mirror area (M$_1$), since the secondary mirror (M$_2$) obstructs some light, and there's generally a big hole in the center of the mirror, in part, because the secondary mirror occludes this area, but mainly because this is where the light is reflected down to the tertiary mirror and/or instruments ($F$ in the image below)!

Image credits: By HHahn (Own work) [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0) or GFDL (http://www.gnu.org/copyleft/fdl.html)], via Wikimedia Commons

So, I guess the summary is: don't judge an Astronomer by the size of her mirror!

A different way to display the ARF

As I was writing this notebook, I got to thinking about other ways to display the data. One obvious choice was a HTML table, which gave me an excuse to play a bit more with the IHaskell display code. Fortunately, I have already installed the necessary Haskell packages - in this case blaze-html - so I can go straight to work. The routines from the Html5 module are going to do all the heavy lifting of creating the HTML output; all I need to do is put them together in the right way.


In [87]:
import Text.Blaze.Html5 hiding (head, map)
import Data.Monoid ((<>), mconcat)

As well as the Html code, I have also loaded the <> operator from the Data.Monoid module. The Monoid typeclass$^\dagger$ represents things that can be combined together, that is - have an associative operator - using <> (well, it actually means a bit more than that, but all I need is the "combine" operator, which is also known as mappend). The mconcat function lets you combine a list of values using <>; that is mconcat [a_0, a_1, a_2, ... a_n] is the same as a_0 <> a_1 <> a_2 <> ... <> a_n.

$^\dagger$ 10 points to the person in the back of the class - yes, you over there in the floral dress - who just mumbled "oh what, Yet Another Mathematical Abstraction In Haskell!".

As an example, lists can be concatenated together with ++, but this form is specific to lists. The generic version - since a list has a Monoid instance - is <>; that is, the following two are the same:


In [88]:
[1,3] ++ [2,4]
[1,3] <> [2,4]


[1,3,2,4]
[1,3,2,4]

To start off, I want to try converting a single Row into a row of a table (i.e. create a <tr> element containing the column data in <td> elements). To do this, each element of the row is

  • converted to Html using toHtml

  • and then converted to a <td> element, using the appropriately-named td function.

The three <td> elements are combined together using <>, which appends them to form <td>...</td><td>...</td><td>...</td>. These are then placed inside a <tr> element.


In [89]:
rowToHtml (Row elo ehi sp) = 
  let col val = td (toHtml val)
  in tr (col elo <> col ehi <> col sp)

Now, the toHtml function has an interesting signature:


In [90]:
:type toHtml


toHtml :: forall a. ToMarkup a => a -> Html

which says that it can be given any type, as long as it is an instance of the ToMarkup typeclass. This constraint - indicated by the text to the left of the => in the signature, "infects" the rowToHtml signature, since we have:


In [91]:
:type rowToHtml


rowToHtml :: forall a. ToMarkup a => Row a -> Html

This makes sense, since it says that rowToHtml can only be used on a Row whose type can be converted to Html. Fortunately the Float type is an instance of the ToMarkup class. You can find this out from the documentation, either of the ToMarkup class or Float - e.g. (the :opt line is just to tell the notebook to include the information inline, so it's visible in the HTML version of the page, but it appears that it may not always display particularly wel):


In [92]:
:opt no-pager

In [93]:
:info Float


However, the easiest way is to just try it and see if it compiles, or you get an error back (for example, earlier when I tried len / 2880 and got an error about a missing Fractional constraint)!

Applying it to the last row produces:


In [94]:
rowToHtml (last rows)


10.99 11.0 0.5657126

Since the Html type is an instance of the IHaskellDisplay type class, the notebook has automatically interpreted this for us, which is why we can't see the <td> and <tr> elements.

In fact, I can make the Row type an instance of the ToMarkup typeclass using this function:


In [95]:
instance ToMarkup a => ToMarkup (Row a) where
  toMarkup = rowToHtml

which means that I can now say:


In [96]:
toMarkup (last rows)


10.99 11.0 0.5657126

Due to the way the code is setup, if you can call toMarkup on a type you can also call toHtml:


In [97]:
toHtml (last rows)


10.99 11.0 0.5657126

What happens if I try to convert several rows?


In [98]:
toMarkup (take 3 rows)


No instance for (ToMarkup [Row Float]) arising from a use of ‘toMarkup’
In the expression: toMarkup (take 3 rows)
In an equation for ‘it’: it = toMarkup (take 3 rows)

In other words, there is no instance of ToMarkup a => ToMarkup [a], which makes sense because just because I know how to convert a single element to Html, there's no single way to combine them. This suggests that, semantically, I want a data type that represents a "table", so that it makes sense to say "oh, when converting this to Html I want to include a table header as well as the data".

This leads me to defining an ARF type, based on a list of Row values (I am also going to restrict the row type to be Float):


In [99]:
newtype ARF = ARF [Row Float]

A value with a type of ARF is created by using the ARF label (constuctor), but as shown below, the notebook doesn't know what to do with it (because I did not tell the compiler to create a Show instance for the type, as I did with the Row type):


In [100]:
ARF rows


Unshowable:ARFNo instance for (Show ARF) arising from a use of ‘print’
In a stmt of an interactive GHCi command: print it

A Show instance could be added easily - for instance:

instance Show ARF where
  show (ARF rows) = "ARF " ++ show rows

but we're here to create a HTML table! Since there's a lot of rows - over a thousand of them - I want to be able to show only a subset of them in this table. One simple interface is to use a Maybe Int as an indicator:

  • Nothing means to display all rows
  • Just n means only show the first and last n rows

although other options are possible. This suggests the following function, which splits an input list up into the start and end subsets, if a subset is asked for (i.e. the count is Just n) and the list is long enough. In this case I use the Either type to return a value, not because the Left case should be considered a failure, but because I want to return either the full list (Left) or the start and end subsets (Right). The reason for splitting the two cases will hopefully become clear soon$^\dagger$.

$^\dagger$ If you can't wait, it's because I want to add a row of "..." characters to indicate when rows have been excluded.


In [101]:
subsetRows :: 
  Maybe Int  -- ^ how many rows to select from start and end, if any 
  -> [a]     -- ^ the input list
  -> Either [a] ([a],[a])
subsetRows mcount xs =
  let n = length xs
  in case mcount of
    Just nr | n > 2 * nr -> let start = take nr xs
                                end = drop (n-nr) xs
                            in Right (start, end)
    _ -> Left xs

There are actually two cases above that result in Left xs being returned:

  • mcount is Just nr, but n <= 2 * nr
  • mcount is Nothing

I use the "catch-all" _ case to handle both of these (this is in contrast to the earlier case when I called viewr but explicitly named both patterns of the case statement).

Checking this gives the expected (or, hopefully you expect them) results:


In [102]:
subsetRows Nothing [1..10]


Left [1,2,3,4,5,6,7,8,9,10]

In [103]:
subsetRows (Just 3) [1..10]


Right ([1,2,3],[8,9,10])

In subsetRows I did not assume anything about the contents of the list (in otherwords, the list type a was not constrained). This is often "a good thing™", since it means that

  • the code is more re-usable
  • you can be more certain what the code does (or, perhaps more-importantly, what it can not do).

That is, with a type like [a] -> [a], we know that the function can either return the empty list or values drawn from the input list (maybe repeated), but it can not return a value not in the input list (since, without knowing the type, it can not "create" a value). This can be compared to a function like [Int] -> [Int] which is free to replace every even value by 2 and odd value by the length of the list.

Eventually I do need to write a function with a specific type! In this case, I want something that will create the <tr> elements for the HTML table, marking any excluded rows by using a row of ellipses - i.e. a row like


In [104]:
toHtml (Row "..." "..." "...")


... ... ...

but using the UTF-8 ellipsis character rather than three dots.

Using subsetRows I can convert the output rows into HTML using toHtml, and insert a row of ellipsis if the return value was Right. I could do this just for Row Float, but let's keep trying to be general and allow any row with a ToMarkup-compatible type:


In [105]:
htmlSubset :: ToMarkup a => Maybe Int -> [Row a] -> Html
htmlSubset mcount xs = 
  let out = case subsetRows mcount xs of
        Left ys -> map toHtml ys
        Right (ls,rs) -> let hellip = "…"
                             spacer = toHtml (Row hellip hellip hellip)
                         in map toHtml ls <> [spacer] <> map toHtml rs 
  in mconcat out

I used <> to combine the lists but I could have also used ++ here. Since each element of out is Html, and Html is a Monoid, I can use mconcat to combine them together, so that the return is Html rather than [Html].

There's a subtle point in the above for the newcomer to Haskell: why did I write

map toHtml ls <> [spacer] <> map toHtml rs

rather than

map toHtml (ls <> [spacer] <> rs)

that is, combine ls, spacer, and rs into a list first? The reason is because the types don't match up: ls and rs both have type [Row a], but spacer is the output of a toHtml call, and so has type Html. The compiler knows this, and won't let you combine lists of different types. So, I convert the ls and rs values to Html, and then I can combine the three terms.

As a check it works (although it isn't displayed particularly nicely, as there is no <table> around the <tr>'s created by the routine):


In [106]:
htmlSubset (Just 2) rows


0.3 0.31 4.874343 0.31 0.32 14.829262 … … … 10.98 10.99 0.5852773 10.99 11.0 0.5657126

The "header" row can be created manually (I'd probably not pull this out into a separate routine, other than in a notebook like this). It's somewhat ugly since there's a lot of type conversion going on (i.e. all the calls to toHtml). There are ways to avoid this - in particular the OverloadedStrings GHC extension, but there are times when it's worth being explicit (if only to point out where some of the "warts" in a lanugage are):


In [107]:
arfHeader :: Html
arfHeader =   
  let c1 = toHtml "E" <> sub (toHtml "low") <> toHtml " (keV)"
      c2 = toHtml "E" <> sub (toHtml "high") <> toHtml " (keV)"
      c3 = toHtml "ARF (cm" <> sup (toHtml "2") <> toHtml ")"
  in tr (th c1 <> th c2 <> th c3)
  
arfHeader


E low (keV) E high (keV) ARF (cm 2 )

These can then be put together to create a table:


In [108]:
-- | Convert a list of rows into a HTML table.
--
--
arfToHtml ::
  Maybe Int   -- ^ if `Nothing`, display all rows, otherwise just this number at the start and end
  -> ARF      -- ^ data to display
  -> Html     -- ^ HTML represenation
arfToHtml mcount (ARF rows) =
  table (thead arfHeader <> tbody (htmlSubset mcount rows))

Testing on a subset of the rows gives (the output may not appear quite correctly, in that horizontal and vertical lines may be missing, depending on how you are viewing this notebook):


In [109]:
arfToHtml Nothing (ARF (take 5 rows))


E low (keV) E high (keV) ARF (cm 2 )
0.3 0.31 4.874343
0.31 0.32 14.829262
0.32 0.33 21.302292
0.33 0.34 28.514952
0.34 0.35 35.39883

In [110]:
arfToHtml (Just 1) (ARF (take 5 rows))


E low (keV) E high (keV) ARF (cm 2 )
0.3 0.31 4.874343
0.34 0.35 35.39883

With this routine, I can finally create an instance of IHaskellDisplay for the ARF type, where I choose to use a display of 3 rows:


In [111]:
instance IHaskellDisplay ARF where
  display = display . arfToHtml (Just 3)

and use this to display the ARF in tabular form:


In [112]:
ARF rows


E low (keV) E high (keV) ARF (cm 2 )
0.3 0.31 4.874343
0.31 0.32 14.829262
0.32 0.33 21.302292
10.97 10.98 0.6049552
10.98 10.99 0.5852773
10.99 11.0 0.5657126

This could be extended to include metadata about the ARF, but that would require extracting the relevant information from the header (i.e. parsing the value strings to separate out the data from the comment string), and enhancing the ARF data type to store this information. Perhaps next time...

The end

There you go; I hope you enjoyed it. If you have any questions, then please use the GitHub issues page or contact me on Twitter at https://twitter.com/doug_burke.